Hybrid N-gram Probability Estimation in Morphologically Rich Languages

نویسندگان

  • Hyopil Shin
  • Hyun-Jo You
چکیده

N-gram language modeling is essential in natural language processing and speech processing. In morphologically rich languages such as Korean, a word usually consists of at least one lemma (content morpheme) and functional morphemes which represent various grammatical. Most word forms in Korean, however, have problems of sparse data and zero probability, because of quite complex morpheme combinations. Thus morpheme-based N-gram modeling is widely used instead of a word sequence modeling. In this paper, we contend that a morpheme-based N-gram is inefficient language modeling in that it inevitably approximates the probability of unnecessary morpheme sequences, so the longer sequences we have, the lower probability estimates we get. We suggest a hybrid method that joins word-based and morpheme-based language modeling. The new method can also be regarded as an extension of a class-based measurement. Our experimental results show that the method produces better probability estimation than the morphemebased measurement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies

We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models, FlexGrams, which assume that the n−1 tokens that determine the probability of a given token can be chosen anywhere in the se...

متن کامل

Morpheme level hierarchical pitman-yor class-based language models for LVCSR of morphologically rich languages

Performing large vocabulary continuous speech recognition (LVCSR) for morphologically rich languages is considered a challenging task. The morphological richness of such languages leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities. In this case, the use of morphemes has been shown to increase the lexical coverage and lower the LM perplexity. Another approach ...

متن کامل

Valence, arousal and dominance estimation for English, German, Greek, Portuguese and Spanish lexica using semantic models

We propose and evaluate the use of an affective-semantic model to expand the affective lexica of German, Greek, English, Spanish and Portuguese. Motivated by the assumption that semantic similarity implies affective similarity, we use word level semantic similarity scores as semantic features to estimate their corresponding affective scores. Various context-based semantic similarity metrics are...

متن کامل

Sub-word Based Language Modeling for Amharic

This paper presents sub-word based language models for Amharic, a morphologically rich and under-resourced language. The language models have been developed (using an open source language modeling toolkit SRILM) with different n-gram order (2 to 5) and smoothing techniques. Among the developed models, the best performing one is a 5gram model with modified Kneser-Ney smoothing and with interpola...

متن کامل

Tackling Sparse Data Issue in Machine Translation Evaluation

We illustrate and explain problems of n-grams-based machine translation (MT) metrics (e.g. BLEU) when applied to morphologically rich languages such as Czech. A novel metric SemPOS based on the deep-syntactic representation of the sentence tackles the issue and retains the performance for translation to English as well.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009